Symbolic Regression

Modeling Molecular Mass

Delanyce Rose & Richard Henry

2025-07-31

Outline

  • Symbolic Regression
  • Molecular Mass
  • Typical Workflow
  • Toy Dataset
  • Software Implementation
  • Noisy Toy
  • Belly of the Beast
  • Quiet Toy
  • Alternatives to Evolution
  • Data Exploration
  • Modeling and Results
  • Conclusions

Methods

Definition

  • independent variable \(X\)
  • dependent variable \(y\)
  • analytical expression \(f(X)\)
  • \[\hat{y}_{23} \approx f(X_{23})\]

Alternatives - I

  • neural networks
  • \[\frac{dy}{dX}|_{X_{23}}\approx \frac{\hat{y}_{24}-\hat{y}_{23}}{X_{24}-X_{23}}\]
  • linear regression
  • \[\hat{y}_{23} \approx \beta_0 +\beta_1 \cdot X\]
  • best form?
  • \[\hat{y}_{23} \approx \beta_2 +\beta_3 \cdot e^X\]
  • \[\hat{y}_{23} \approx \beta_4 +\beta_5 \cdot log(X)\]
  • \[log(\hat{y}_{23}) \approx \beta_6 +\beta_7 \cdot log(X)\]
  • Alternating Conditional Expectations
  • \[\hat{y}\approx\frac{\beta_0+\beta_1x+\beta_2x^2}{\beta_3+\beta_4x}\]
  • Kolmogorov-Arnold Network

Typical Workflow

  • what
  • how
  • when
  • initial population
  • fitness evaluation
  • stop ??
  • new population

Toy Dataset

\[\gamma_{API}=\frac{141.5}{\gamma_{o}}-131.5\]

Symbol Meaning Units
\(\gamma_{API}\) API Gravity [degAPI]
\(\gamma_o\) specific gravity [1/wtr]

Software Implementation

from pysr import PySRRegressor()

myMod=PySRRegressor()

myMod.fit(x.reshape(-1, 1),y)

y_pred=myMod.predict(x.reshape(-1, 1))

Software Implementation - II

myEq=myMod.sympy()

\[x_0-(x_0+0.013196754)+1.0131962+ \frac{x_0 (-132.5)- -141.5}{x_0} \qquad(1)\]

sym.simplify(myEq)

\[-131.500000554+\frac{141.5}{x_0}\]

Noisy Toy

w=(random.rand(21)-0.5)*2.5

\[-132.05688+\frac{141.88547}{x_0}\]

Belly of the Beast (Encoding)

binary_operators=["+","-","*","/"]

unary_operators=None

maxsize=30

1 2 3 4 5
minus divide 141.5 specific gravity 131.5

maxdepth=None

Belly of the Beast (Fitness)

elemtwise_loss="L2DistLoss()"

\[ \frac{1}{n} \sum_{k=1}^{n}{\left(\hat{y}_k-y_k \right)}^2\]

\[ \frac{1}{n} \sum_{k=1}^{n}{\left|\hat{y}_k-y_k \right|}\]

parsimony=0.0

model_selection="best"

should_simplify=True

Tournament_selection_n=15

Belly of the Beast - III

Stopping

niterations=100

max_evals=None

timeout_in_seconds=None

Population

populations=31

population_size=27

Quiet Toy

\[ \frac{a_{00}}{x}+a_{01}\]

w<-seq(from=121.5,to=161.5, by=10)

Table 1: Backus-Naur Description of Toy Problem
<expr> ::= <op>(<expr>, <expr>) | <var> | <con>
<op>   ::= "+" | "-" | "*" | "/"
<var>  ::= x
<con>  ::= 121.5 | 131.5 | 141.5 | 151.5 | 161.5
Table 2: gramEvol Results
Grammatical Evolution Search Results:
  No. Generations:  1531 
  Best Expression:  141.5/x - 131.5 
  Best Cost:        0 

Alternatives to Evolution

Although genetic programming has proven to be most popular approach to Symbolic Regression (Dong and Zhong 2025), it has been criticized for being slow and producing bloated equations.

Deterministic methods such as brute-force search (Udrescu and Tegmark 2020), mathematical programming (Austel et al. 2017) and sparse regression (Muthyala et al. 2025) have been employed. The latter is particularly good at removing variables that don’t materially contribute to the prediction. Typically, they don’t solve the speed problem, but are thought to produce more easily interpreted equations.

Methods which rely on information technology have also been used, especially of late. The idea here is to replace the random changes to the equations that are the heart of the evolutionary approach with changes based on what we know about th data. Sometimes this information is learned from the data (Anthes, Sobania, and Rothlauf 2025), and other times it is supplied externally (Keren, Liberzon, and Lazebnik 2023) as “human experience”. In the latter case, there is the risk that the search space may become so restricted that nothing new can be learned.

Neural Network methods have also made an appearance. An early approach (Martius and Lampert 2016) replaced the activation functions in an ANN with trancendental functions such as sine and log. The main idea here is that each layer would host multiple kinds of functions instead of just one. The more recent approaches (Kamienny et al. 2022) pre-train a transformer model to learn the relationships between data and equations using synthetic data. The major advantage here is that if we ignore the pre-training phase, the training phase may be quick enough to use the model in real time.

Next Steps

  1. Examine our dataset and describe some of the equations that have been used historically to calculate molecular mass.

  2. Compare the equations that are generated by PySR for our dataset as the options are varied.

  3. Compare the equations that are generated by evolutionary algorithms vs. deterministic algorithms.

Analysis and Results

Data Exploration and Visualization

Let us take a first look at the dataset from Goossens (Goossens 1996):

We have 3 variables:

Table 3: Goossens Variables
Variable Description Designation
\(Mw\) Molecular Mass dependent
\(SG\) Specific Gravity independent
\(TBP\) True Boiling Point independent

and 70 rows.

Molecular Mass is a proxy for the size of the molecule. For those of you who remember your high school chemistry, it is the mass of the substance per unit mole. Specific Gravity is the density of the substance divided by the density of water. Again, for those of you who remember your high school physics, density is mass per unit volume, so we are expecting these two to be related somehow. True Boiling Point is the temperature at which the substance boils at atmospheric pressure. The true boiling point of water is 100 C.

This is a very small dataset by modern machine learning standards. However, due to the cost of acquiring the molecular mass data, this dataset will be considered “large” by chemical engineering standards.

One of the questions we want to answer is whether datasets like this are “too small” for symbolic regression.

Univariate Analysis

Here is a summary of the dataset:

Table 4: Goosens Dataset Summary
       SG              TBP               MW        
 Min.   :0.6310   Min.   : 306.0   Min.   :  70.0  
 1st Qu.:0.7439   1st Qu.: 373.2   1st Qu.:  99.0  
 Median :0.8474   Median : 584.0   Median : 222.5  
 Mean   :0.8470   Mean   : 575.2   Mean   : 304.6  
 3rd Qu.:0.8784   3rd Qu.: 668.5   3rd Qu.: 297.8  
 Max.   :1.3793   Max.   :1012.0   Max.   :1685.0  

The reader will notice that the specific gravity value range is small compared to the boiling point and molecular mass ranges. Also, the mean of the molecular mass data is nowhere near its median. Lets take a closer look:

Figure 1: Raw Molecular Mass Histogram

The reader will notice a strongly right-skewed distribution with an empty “bucket” near the top. Lets take a peek at the box plots:

Figure 2: Raw Molecular Mass Box-and-Whiskers Plot

The reader will notice that all the values greater than about 600 would be considered outliers if the molecular masses were normally distributed.

Turning to the True Boiling Point variable:

Figure 3: Raw Boiling Point Box-and-Whiskers Plot

We appear to have no outliers for this variable, but this variable is not normally distributed either. Lets check:

Figure 4: Raw Boiling Point Histogram

We appear to have a bimodal distribution here. Lets compare this with the other predictor, specific gravity:

Figure 5: Raw Specific Gravity Histogram

This distribution is skewed to the right, but not as strongly as molecular mass. No signs of a bimodal distribution here. Here is the box plot:

Figure 6: Raw Specific Gravity Box-and-Whiskers Plot

It is interesting that the largest specific gravity values appear to be outliers, even though this variable appears to be the “most normally distributed” of the three.

Bivariate Analysis

Here is a plot of molecular mass vs. specific gravity:

Figure 7: Molecular Mass vs Specific Gravity Goossens Data

Although there appears to be a clear linear relationship between molecular mass and specific gravity at low gravity numbers, the heteroscedasticity explodes above a gravity of about 0.75.

Notice that the molecular mass “outliers” appear to be different from the specific gravity “outliers”. A reasonable assumption going in was for these two variables to be (more) positively correlated, placing both “outlier” groups in the upper right corner of the plot.

Notice that the largest variation in molecular mass corresponds to the specific gravity range 0.8-0.9, which, according to Figure 5 is our mode.

Here is the molecular mass vs. true boiling point scatter plot:

Figure 8: Molecular Mass vs Boiling Point Goossens Data

There seems to be a monotonically increasing relationship between molecular mass and true boiling point, which matches the expectation going in, unlike Figure 7. There is a possible “pole” around the boiling point of 1000, which we will discuss in our results.

At this point, it may be tempting to ignore the effect of specific gravity on the prediction of molecular mass.

Here we plot the two independent variables against one another:

Figure 9: Specific Gravity vs Boiling Point Goossens Data

This plot suggests that the correlation is poor between true boiling point and specific gravity. Presumably, the specific gravity helps to reduce the scatter around the trend of molecular weight vs. true boiling point.

There is a second, smaller independent dataset available publicly (Hosseinifar and Shahverdi 2021) which we can use for verification. Let us compare it to the Goossens dataset:

Figure 10: Molecular Mass vs Boiling Point. Both Datasets
Figure 11: Molecular Mass vs Specific Gravity. Both Datasets

The two datasets appear to be compatible, even though the variation of the Hosseinifar dataset is significantly smaller.

Existing Correlations

The existing correlations available for estimating molecular mass from true boiling point and specific gravity provide hints as to what kind of equations we expect to see from the algorithm. Many of them predate symbolic regression and therefore were developed by people using intuition and experimentation.

Here are a few of them using the following nomenclature:

Table 5: Correlation Nomenclature
Symbol Meaning
\(M_w\) Apparent Molecular Mass
\(T_b\) True Boiling Point Temperature
\(\gamma_o\) Specific Gravity
\(a_{00}..a_{09}\) Empirical Constants
\(K_w\) Characterization Factor (intermediate value)
\(X_0...X_3\) Intermediate Variables
Hariu & Sage (1969)

\[ M_w = a_{00} + a_{01} K_w + a_{02} K_w^2 + a_{03} T_b K_w + a_04 T_b K_w^2 + a_{05} T_b^2 K_w + a_{06} T_b^2 K_w^2 \qquad(2)\]

\[K_w =\frac{\sqrt[3]T_b}{\gamma_o} \qquad(3)\]

Kesler & Lee (1976)

\[M_w = X_0 + \frac{X_1}{T_b} + \frac{X_2}{T_b^2} \qquad(4)\]

\[X_0 = a_{00} + a_{01} γ_o+ \left (a_{02} + a_{03} γ_o \right ) T_b \qquad(5)\]

\[ X_1 = \left (1+ a_{04} γ_o + a_{05}γ_o^2 \right ) \left (a_{06} + \frac{a_ {07}}{T_b} \right ) \cdot 10^7 \qquad(6)\]

\[ X_2 = \left (1+ a_{08} γ_o+ a_{09} γ_o^2 \right ) \left (a_{10} + \frac{a_{11}}{T_b} \right ) \cdot 10^{12} \qquad(7)\]

American Petroleum Institute (1977)

\[ M_w = a_{00} e^ {\left (a_{01} T_b \right )} e^{\left (a_{02} γ_o \right )} T_b^{a_{03}} γ_o^ {a_{04}} \qquad(8)\]

Winn, Sim & Daubert (1980)

\[M_w = a_{00} T_b^{a_ {01}} γ_o^{a_{02}} \qquad(9)\]

Riazi & Daubert (1980)

\[M_w = a_{00} T_b^{a_ {01}}γ_o^{a_{02}} \qquad(10)\]

Rao & Bardon (1985)

\[ln {M_w} = (a_{00} + a_{01} K_w) ln (\frac{T_b} {a_{02} + a_{03} K_w} ) \qquad(11)\]

See Equation 3

Riazi & Daubert (1987)

\[ M_w = a_{00} T_b^{a_{01}} γ_o^{a_{02}} e^{\left (a_{03} T_b + a_{04} γ_o + a_{05} T_b γ \right )} \qquad(12)\]

Goossens (1996)

\[M_w = a_{00} T_b^{X_0} \qquad(13)\]

\[ X_0 =\frac {a_{03} + a_{04} ln {\left (\frac{T_b} {a_{05} - T_b} \right )}} {a_{01} γ_o + a_{02}} \qquad(14)\]

Linan (2011)

\[ M_w = a_{00} e^{\left (a_{01} T_b \right )} e^{\left (a_{02} γ_o \right )} T_b^ {a_{03}} γ_o^{a_{04}} \qquad(15)\]

Hosseinifar & Shahverdi (2021)

\[M_w = {\left [a_{00} T_b^{a_{01}} {\left (\frac{3+2γ_o} {3-γ_o} \right )}^{\frac{a_{02}}{2}} + a_{03} T_b^{a_{04}} {\left (\frac{3+2γ_o}{3-γ_o} \right )}^{\frac{a_{05}}{2}} \right ]}^{a_{06}} \qquad(16)\]

Stratiev (2023)

\[ M_w = a_{00} + a_{01} e^{\left [a_{02} e^{\left (a_{03} \frac{T_b^{a_{06}}}{γ_o^{a_{05}}} \right )} \right ]} \qquad(17)\]

The reader will notice that all the correlations are non-linear, and that only a few of them can be easily transformed into a linear relationship. Some of the additional operators we may want to consider from an inspection of these equations include:

Table 6: Partial List of Additional Operators
Operator Type Description
pow binary one expression raised to the power of another
log unary logarithm of an expression
exp unary antilogarithm of an expression
sqr unary expression squared
cub unary expression cubed
inv unary inverse of an expression

Modeling and Results

In this sub-section, we will apply a series of symbolic regression models to our Goossens dataset, and look at the structure and performance of each generated equation.

Default Run

Our first experiment is to run PySR with default parameters against our molecular mass dataset. The resulting equation looks nothing like the existing equations:

\[ M_w=a_{00}+\frac{a_{01}}{\gamma_o-a_{02}}+\frac{T_b\cdot (a_{03}\cdot T_b-a_{04})}{\gamma_o\cdot (a_{05}-\frac{a_{06}}{T_b})} \qquad(18)\]

But its performance is quite impressive considering its restricted grammar of addition, subtraction, multiplication and division:

Figure 12: Default Parameter Run

It even has a (slightly) better correlation coefficient with the raw molecular mass data than Equation 13 presented with this data in the Goossens paper:

Table 7: Default Run Correlation Coefficients
Raw SG Raw TBP Raw MW Goossens
Equation
This
Equation
Raw SG 1.000000 0.625218 0.334852 0.339126 0.337190
Raw TBP 0.625218 1.000000 0.869591 0.868486 0.871037
Raw MW 0.334852 0.869591 1.000000 0.999711 0.999847
Goossens Equation 0.339126 0.868486 0.999711 1.000000 0.999798
This Equation 0.337190 0.871037 0.999847 0.999798 1.000000

There are some challenges. The constant \(a_{02}\) is only slightly larger than the largest \(\gamma_o\) (specific gravity) in the dataset. This limits the extrapolation power of this equation to higher specific gravities than seen in this dataset. Extrapolation is one of the strengths of Symbolic Regression (Sahoo, Lampert, and Martius 2018)

Next, we look at the residuals, plotted against the raw molecular mass:

Figure 13: Default Run Residuals

Visually, the residuals appear to be normally distributed, but running the Shapiro-Wilk test reveals a probability of only 0.001 of the residuals being normally distributed. So much for eye-balling!

Let us see how bad it is:

Figure 14: Q-Q Plot for the Default Run Residuals

The plot suggests that small residuals (say \(\pm 2.5\)) are normally distributed, but the large residuals are not.

Out of mere curiosity, let us examine the residuals from the Goossens correlation (Equation 13):

Figure 15: Goossens Correlation Residuals

This looks largely similar to the default-run residuals, until it is noticed that the zero line does not bisect the plot. The Shapiro-Wilk test p-value is zero to six significant figures.

Power Run

Our Second experiment is to drop the “division” binary operator \(\left (\frac{a}{b} \right )\) and add the “power” binary operator \(\left (a^b \right )\) as the latter is a popular component of the existing correlations.

There is noticeable drop in the quality of the match visually:

Figure 16: Power Operator Run

Although the equation has got simpler:

\[M_w=a_{00} \cdot a_{01}^{\gamma_o^{-a_{02}} + a_{03} \cdot T_b} + a_{04} \qquad(19)\]

and the correlation coefficient has degraded slightly:

Table 8: Power Operator Run Correlation Coefficients
Raw SG Raw TBP Raw MW This Equation
Raw SG 1.000000 0.625218 0.334852 0.325151
Raw TBP 0.625218 1.000000 0.869591 0.868747
Raw MW 0.334852 0.869591 1.000000 0.997281
This Equation 0.325151 0.868747 0.997281 1.000000

One can argue that nested exponents \(\left (a^{b^c} \right )\) aren’t very explainable. However the Stratiev correlation (Equation 17) does this with Euler’s number \(\left ( a \cdot e^{b \cdot e^{c}} \right )\). In the hydrocarbon flow literature, nested logarithms \(ln(ln(a))\) are more common.

Exponential Run

Our third experiment was to replace the binary power operator with unary logarithm and exponential operators. Our initial results were a major surprise as neither transcendental operator made it to the final equation:

\[ M_w= - T_b \cdot \left(a_{00} \cdot T_b - a_{01}\right) \left(a_{02} \cdot 10^{-6} \left(a_{03} \cdot \gamma_o - 2 \cdot T_b \right) \left(T_b - a_{04}\right) - 1\right) + a_{05} \qquad(20)\]

Accordingly, we re-set the random number generator and tried again:

\[ M_w= a_{00}\cdot T_b + a_{01} \cdot e^{- \gamma_o^{2} + a_{02}\cdot \gamma_o + a_{03} \cdot T_b} \qquad(21)\]

This is a dramatically different equation. This experience suggests that the real reason the deterministic algorithms are still under active development today is to avoid this kind of ambiguity.

Our correlation coefficients are still not as good as the default run:

Table 9: Exponential Run Correlation Coefficients
Raw SG Raw TBP Raw MW 1st
Equation
2nd
Equation
Raw SG 1.000000 0.625218 0.334852 0.348819 0.344733
Raw TBP 0.625218 1.000000 0.869591 0.869577 0.867574
Raw MW 0.334852 0.869591 1.000000 0.997705 0.998420
First Equation 0.348819 0.869577 0.997705 1.000000 0.999497
Second Equation 0.344733 0.867574 0.998420 0.999497 1.000000

But the risk of “division by zero” errors during extrapolation are negligible.

Also notable in this run is the absence of the logarithm function to linearize the relationship between these two variables:

Figure 17: Raw Data Scale Change

This “obvious” human observation appears not to be an optimum transformation, or was eliminated by the evolutionary algorithm before it could reach its full potential.

Aeon Run

The fourth experiment is the same as the third experiment, except that we run it for ten times as long. Here, we are interested in whether different starting points will eventually converge to similar equations.

Here is the first equation after a thousand iterations:

\[ M_w=(a_{00}\cdot T_b-a_{01})e^{a_{02}\cdot 10^{-9}T_b^2(a_{03}\cdot \gamma_0 +a_{04}\cdot T_b)} \qquad(22)\]

And the second equation1:

\[ M_w=-a_{00}\cdot \gamma_o+\frac{e^{-\gamma_o^2+{log(T_b)}^3}}{T_b^{a_{01}}} +a_{02}\cdot T_b \qquad(23)\]

The correlation coefficients, however, now rival the default run:

Table 10: Aeon Run Correlation Coefficients
Raw Mw First Equation Second Equation
Raw Mw 1.000000 0.998880 0.999324
First Equation 0.998880 1.000000 0.998851
Second Equation 0.999324 0.998851 1.000000

Both equations have evolved significantly by adding 900 iterations, but “similar, they are not” as everyone’s favourite 900 year old would say.

The second equation is truly fascinating for two reasons. Neither the power operator \(\left (a^b \right )\) nor the division operator \(\left (\frac{a}{b} \right)\) were provided to the algorithm for this run, and yet we have found the middle expression \(\left( \frac{e^{-\gamma_o^2+{log(T_b)}^3}}{T_b^{a_{01}}} \right)\).

Box-Cox Run

The reader may remember that our molecular mass (Figure 1) was not normally distributed. Accordingly, here we transform our dependent variable to see if this improves the distribution of our residuals. Our optimum exponent works out to be \(-0.3624\) according to the python stats library, which is fairly close to zero.

Our transformed molecular mass distribution is hardly normal:

Figure 18: transformed molecular mass histogram

The other thing the reader may have noticed in Figure 18 is that the transformed variable range is much narrower. If we round-up to an exponent of zero, then our transform will now be logarithmic:

Figure 19: transformed molecular mass

This expands the range of the dependent variable but probably makes it less Gaussian. Let us go in the opposite direction and round-down to an inverse square root:

Figure 20: transformed molecular mass

This appears to compress the range of the transformed variable even more. Let us stay with \(\lambda=-0.3624\) and press on. Using all four basic binary operators (like the default case) and adding exponentials and logarithms (like the exponential case), our fit looks quite good in Box-Cox space:

Figure 21: transformed prediced vs. actual, Box-Cox Run

The residuals in Box-Cox space are more spread-out:

Figure 22: transformed Residuals, Box-Cox Run

And the Shapiro-Wilk test calculates the probability of the residuals being normal to 0.0507, which although higher, is barely over a level of significance of 0.05. When we untransform the predictions, and compare them to the measured molecular masses, the fit looks a little smoother:

Figure 23: Untransformed prediced vs. actual, Box-Cox Run

This run produced the same equation structure at 100, 500 and 1000 iterations, which was not the case for the PySR runs in the Aeon experiment above:

\[ M_w^t = \left (\gamma_o \left(a_{00} T_b -a_{01} \right) log{(\gamma_o)} + a_{02} \right) log{(T_b)} \qquad(24)\]

We do need to remember to reverse the Box-Cox transform:

\[ M_w = {\left(\lambda M_w^t +1 \right)}^{\frac{1}{\lambda}} \qquad(25)\]

The correlation coefficient for the longest run (1000 iterations) is in line with the Aeon runs discussed earlier:

Table 11: Box-Cox Run Correlation Coefficients (Transform Reversed)
Raw Mw This Equation
Raw Mw 1.000000 0.995897
This Equation 0.995897 1.000000

A quick look at the residuals between the Reverse-Transformed predictions and the raw molecular masses tell the story:

Figure 24: Residuals outside of Box-Cox Space, Box-Cox Run

Residuals are small for molecular masses below about 350, but explode above 1000. Statistically speaking, this may be good if we are more uncertain about the accuracy of the larger molecular masses, but the engineering point of view is different for exactly the same reason. Larger deviation from existing datapoints in regions where the datapoints are scarce is risky business.

First Sparse Run

It would seem remiss to not at least look at one of the deterministic models. The first library to work without requiring long-depreciated versions of Python is SyMANTIC (Muthyala et al. (2025)) so we set up the run to be equivalent the the Default Run above. Our “winning” equation is:

\[ M_w=-a_{00} \cdot \gamma_o +\frac{a_{01}\cdot \gamma_o}{T_b}+a_{02}\cdot T_b -a_{03} \qquad(26)\]

Compare this to Equation 23. Similar, but much simpler. The quality of the fit, however, is disappointing:

Table 12: First Sparse Run Correlation Coefficients
Raw Mw This Equation
Raw Mw 1.000000 0.970451
This Equation 0.970451 1.000000

A pleasant surprise is that we will have no “division by zero” problems with this equation as \(T_b\) is in Kelvin.

A sparse regression model, “under the hood” is just a linear regression model. Accordingly, even though we are not thrilled with the fit, maybe the residuals are normal…

Figure 25: 1st Sparse Run Residuals

Never mind. Even if these residuals are normal (they are not) they are definitely not random. Moving right along…

Second Sparse Run

This leads to a second deterministic experiment, in which we add the logarithm and exponential functions, but leave the division operator in place. These is the same operator set used in the Box-Cox run above.

Our new equation is not complex either:

\[ M_w=-a_{00}\cdot \gamma_o\cdot T_b +a_{01}\cdot T^2_b +a_{02}\cdot e^{\gamma_o} +a_{03} \qquad(27)\]

And our fit improves slightly:

Table 13: Second Sparse Run Correlation Coefficients
Raw Mw
Raw Mw 1.000000
1st Run Equation 0.970451
2nd Run Equation 0.985800

But our plot of measured molecular mass versus predicted molecular mass is not impressive at all:

Figure 26: Second Sparse Run

Validation Runs

As promised, we used the Hosseinifar dataset of 10 records (12.5%) to validate some of these equations:

Table 14: Validation Run Correlation Coefficients
Run Correlation Coefficient
Raw Mw 1.000000
Default 0.999957
Goossens Correlation 0.999939
Power 0.996964
Exponential #2 0.998954
Aeon #1 0.999973
Aeon #2 0.999921
Box-Cox 0.999696
Sparse #1 0.992303
Sparse #2 0.996331

All of these are very good, with the validation coefficients exceeding the training coefficients. We will resist the temptation to declare victory over overfitting, as the Hosseinifar dataset is far better behaved than the Goossens dataset.

A better approach with the evolutionary model is probably to start increasing parsimony or decreasing maxsize or maxdepth until the correlation coefficients start to deteriorate.

A different approach is warranted with the deterministic models. It is possible that these are underfit, meaning that we can expand the number of unique terms the algorithm initially considers, or the number of terms we want to keep.

General Observations

The “UOP Characterization Factor” Equation 3 which (at least) two researchers decided was an important correlating variable for molecular mass, was not discovered by our symbolic regression model.

In fairness, however, we dropped the division operator early in our experimentation, and never asked the model to consider a “cube root” unary operator.

Replication is very challenging with these symbolic regression libraries that use evolutionary algorithms. Re-running the same dataset on the same machine with the same libraries produces different equations. In many Scikit-Learn algorithms, answer variation can be minimized by seeding the random number generator the same way each time, but in PySR we also have to turn off the optimizations that allow Julia to run the calculations massively in parallel, which contributes to the criticism that these algorithms are slow.

The deterministic symbolic regression libraries don’t need random number seeds, and the answers stay the same between runs. The equations are definitely more explainable, but appear to be less accurate. Does this accuracy matter? Is it real? The answer to these questions depend on how good we think our measurements are!

In the engineering world, an old trick to speed-up calculations is to use look-up tables instead of functions. The idea is that even though we may know exactly what the governing equations are, producing look up tables from these functions ahead of time and then applying Gaussian interpolation in real time may be faster. This is a very similar concept to the idea of pre-training a transformer network to do symbolic regression.

This paper has shown that high fidelity models can be constructed from “cheap” components (i.e. arithmetic operators). Maybe explainability is overrated, and functions carefully built using symbolic regression may be competitive with old-man Gauss!

If this turns out to be true, then symbolic regression will make the full circle and return to it roots.

Conclusions

In summary, we have explored Symbolic Regression using evolutionary and deterministic algorithms on our molecular mass dataset. We have not spent much time on different kinds of engines (e.g. neural networks, mathematical programming) or tested the extrapolation chops of this technology.

Our main conclusions are:

  1. Symbolic Regression can generate reasonable predictions when trained on datasets too small for neural networks.

  2. Evolutionary Symbolic Regression will typically generate novel equations from the human point of view, but these equations can be as accurate than the ones created by humans.

  3. The “explainability” of the generated equations are probably overrated, unless enough guardrails are put up to constrain the space searched by the evolutionary Symbolic Regression algorithm.

  4. The ability of Evolutionary Symbolic Regression to easily find alternative equations of about the same accuracy means that humans can repeat the workflow until an equation that is more “explainable” than the others is generated.

  5. The structure of an equation generated with symbolic regression will withstand some noise.

  6. Piping the “winning” equation through a symbolic mathematics package to simplify and consolidate it is a good idea. The effect on the deterministic equations is very slight.

  7. This consolidation often introduces relationships between variables and constants that were not specified before the algorithm was started. This can nudge the human to reconsider relationships previously rejected.

  8. All Symbolic Regression libraries using evolutionary algorithms are not created equal. Library comparison is beyond the scope of this report, but we can comment that PySRwas one of the better ones.

  9. Running both evolutionary and deterministic models on the same problem is also a good idea.

References

Anthes, Philipp, Dominik Sobania, and Franz Rothlauf. 2025. “Transformer Semantic Genetic Programming for Symbolic Regression.” arXiv Preprint arXiv:2501.18479.
Austel, Vernon, Sanjeeb Dash, Oktay Gunluk, Lior Horesh, Leo Liberti, Giacomo Nannicini, and Baruch Schieber. 2017. “Globally Optimal Symbolic Regression.” arXiv Preprint arXiv:1710.10720.
Dong, Junlan, and Jinghui Zhong. 2025. “Recent Advances in Symbolic Regression.” ACM Computing Surveys 57 (11): 1–37.
Goossens, Adriaan G. 1996. “Prediction of Molecular Weight of Petroleum Fractions.” Industrial & Engineering Chemistry Research 35 (3): 985–88.
Hosseinifar, Pouya, and Hamidreza Shahverdi. 2021. “A Predictive Method for Constructing the Distillation Curve of Petroleum Fluids Using Their Physical Bulk Properties.” Journal of Petroleum Science and Engineering 200: 108403.
Kamienny, Pierre-Alexandre, Stéphane d’Ascoli, Guillaume Lample, and François Charton. 2022. “End-to-End Symbolic Regression with Transformers.” Advances in Neural Information Processing Systems 35: 10269–81.
Keren, Liron Simon, Alex Liberzon, and Teddy Lazebnik. 2023. “A Computational Framework for Physics-Informed Symbolic Regression with Straightforward Integration of Domain Knowledge.” Scientific Reports 13 (1): 1249.
Martius, Georg, and Christoph H Lampert. 2016. “Extrapolation and Learning Equations.” arXiv Preprint arXiv:1610.02995.
Muthyala, Madhav R, Farshud Sorourifar, You Peng, and Joel A Paulson. 2025. “SyMANTIC: An Efficient Symbolic Regression Method for Interpretable and Parsimonious Model Discovery in Science and Beyond.” Industrial & Engineering Chemistry Research 64 (6): 3354–69.
Sahoo, Subham, Christoph Lampert, and Georg Martius. 2018. “Learning Equations for Extrapolation and Control.” In International Conference on Machine Learning, 4442–50. Pmlr.
Udrescu, Silviu-Marian, and Max Tegmark. 2020. “AI Feynman: A Physics-Inspired Method for Symbolic Regression.” Science Advances 6 (16): eaay2631.